# Breaking Safety Alignment in Large Vision-Language Models via Benign-to-Harmful Optimization

This repository contains the official code and datasets for our ICLR 2026 submission:
**"Breaking Safety Alignment in Large Vision-Language Models via Benign-to-Harmful Optimization"**

we propose Benign-to-Harmful (B2H) optimization, a new jailbreak paradigm that decouples conditioning and targets (i.e., the target is not the next-token continuation of the conditioning).
This process requires modifying the model’s forward function, and we provide the corresponding code in this repository.

---

## 🔧 Environment Settings

### ⚙️ LLaVA-1.5 Setup

```bash
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA

conda create -n llava-15 python=3.10 -y
conda activate llava-15

pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"

pip install flash-attn==2.6.3 --no-build-isolation --no-cache-dir
pip install omegaconf decord opencv-python fairscale spacy pycocoevalcap
pip install numpy==1.26.4
```
Then, replace the file 
**/home/user/anaconda3/envs/llava-15/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py**
with provided **modeling_llama.py**

---

### ⚙️ InstructBLIP Setup

```bash
conda create -n i-blip python=3.10 -y
conda activate i-blip

git clone https://github.com/salesforce/LAVIS.git
cd LAVIS

pip install -e .
pip install omegaconf decord opencv-python fairscale spacy pycocoevalcap
pip install numpy==1.26.4
```

---

## 🧨 Jailbreak Image Optimization

We support two attack modes:

- **Benign-to-Harmful (B2H)**  
- **Harmful-Continuation (Baseline)** 

### 🔸 InstructBLIP 

```bash
conda activate i-blip  
python B2H_jailbreak_instructblip.py
```

### 🔸 LLaVA-1.5 

```bash
conda activate llava-15
python B2H_jailbreak_llava_15.py
```

---

### 📌 Common Arguments

| Argument        | Description |
|-----------------|-------------|
| `--gpu_id`      | GPU index to use (default: 0) |
| `--n_iters`     | Number of attack steps (default: 5001) |
| `--eps`         | Attack budget (epsilon) (default: 16) |
| `--alpha`       | Attack step size (default: 1) |
| `--constrained` | Enable L-infinity constraint (optional flag) |
| `--batch_size`  | Batch size (default: 1) |
| `--ours`        | Set to `True` for **Benign-to-Harmful**, `False` for **Harmful-Continuation** |
| `--th`          | Tau threshold hyperparameter (default: 0.1) |
| `--save_dir`    | Directory to save adversarial images and logs |

---

## 📊 Evaluation

After generating adversarial images, run the following to compute attack success rates and unsafe scores:

## 📂 Dataset Preparation

This repository supports evaluation across five standard safety benchmarks:

| Dataset Name             | Source                                                                 | Loading Method                      |
|--------------------------|------------------------------------------------------------------------|-------------------------------------|
| `real-toxicity-prompts` | [RealToxicityPrompts (AllenAI)](https://github.com/allenai/real-toxicity-prompts) | Load from local `.jsonl` file      |
| `jailbreakbench`        | [JailbreakBench (HuggingFace)](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) | `load_dataset` from HuggingFace    |
| `AdvBench`              | [AdvBench (HuggingFace)](https://huggingface.co/datasets/walledai/AdvBench) | `load_dataset` from HuggingFace    |
| `HarmBench`             | [HarmBench CSV (included or external)](https://github.com/centerforaisafety/HarmBench/blob/main/data/behavior_datasets/harmbench_behaviors_text_all.csv)                                   | Load from local `.csv` file         |
| `StrongREJECT`          | Provided locally as `.json`                                            | Load from local `.json` file        |

---

### 📥 How to Prepare

1. **Create a `datasets/` folder** in the project root if not already present.
2. Dataset paths can be adjusted inside:
   - `instructblip_inference.py`
   - `llava_v15_inference.py`

   Look for the `load_dataset_by_name()` function and modify local paths as needed.


### 🔹 InstructBLIP Evaluation

```bash
bash eval_and_get_metrics_instructblip.sh
```

### 🔹 LLaVA-1.5 Evaluation

```bash
bash eval_and_get_metrics_llava_15.sh
```

These scripts evaluate outputs using:
- Detoxify
- LLaMA Guard 3
- Perspective API (optional)
- GPT-4o (optional)

For GPT-4o evaluation, set the environment path as follows:

```bash
export OPENAI_API_KEY='Your-OpenAI-Key'
```

---

## 📁 Data Description

| File | Purpose |
|------|---------|
| `harmful_corpus/benign_sentences.csv` | Benign prompts used in B2H attacks (authored) |
| `harmful_corpus/harmful_words.csv` | Target harmful phrases for B2H (authored) |
| `harmful_corpus/derogatory_corpus.csv` | Baseline harmful prompts from prior H-Cont. work |
| `output/` | Adversarial images and prompts will be saved here |

See [`harmful_corpus/README.md`](./harmful_corpus/README.md) for ethical use details.

---

## ⚠️ Ethical Considerations

This repository contains redacted adversarial examples and masked harmful content to enable safety research in LVLMs.

- All materials are intended for research only.  
- Toxic outputs are filtered, masked, or redacted by default.  
- Reuse of this code should follow responsible AI and institutional ethics guidelines.

> **Disclaimer:** The authors do not endorse the views expressed in any of the data or outputs.  
> This project is released solely for the purpose of improving AI safety.

---

## 📌 Citation

Coming soon (upon acceptance to ICLR 2026).

---

## 🔗 License

MIT License (see `LICENSE` file).
